library("knitr")
library("reprex")
library("tidyverse")

1 Getting ready

1.1 Installing R

Go on this link to download R: https://cran.rstudio.com/

Select the version that works for your operating system, and download the latest release (R-3.6.0).

Download R.

Figure 1.1: Download R.

Once you’ve downloaded R, install it following the instructions on the screen.

1.2 Installing R Studio

Go on this link to download R Studio: https://www.rstudio.com/products/rstudio/download/#download

And then download the version that works for your operating system.

Download R Studio.

Figure 1.2: Download R Studio.

Once you’ve downloaded R Studio, install it following the instructions on the screen.

2 Why R?

3 Setting things up

3.1 R Studio

  • A great integrated development environment (IDE) in which you can do all your R coding.
  • let’s change some settings first
  • R Studio cheatsheet
General preferences.

Figure 3.1: General preferences.

Make sure that:

  • Restore .RData into workspace at startup is unselected
  • Save workspace to .RData on exit is set to Never
Code window preferences.

Figure 3.2: Code window preferences.

Make sure that:

  • Soft-wrap R source files is selected

This way you don’t have to scroll horizontally. At the same time, avoid writing long single lines of code. For example, instead of writing code like so:

ggplot(data = diamonds, aes(x = cut, y = price)) +
  stat_summary(fun.y = "mean", geom = "bar", color = "black", fill = "lightblue", width = 0.85) +
  stat_summary(fun.data = "mean_cl_boot", geom = "linerange", size = 1.5) +
  labs(title = "Price as a function of quality of cut", subtitle = "Note: The price is in US dollars", tag = "A", x = "Quality of the cut", y = "Price")

You may want to write it this way instead:

ggplot(data = diamonds, aes(x = cut, y = price)) +
  # display the means
  stat_summary(fun.y = "mean",
               geom = "bar",
               color = "black",
               fill = "lightblue",
               width = 0.85) +
  # display the error bars
  stat_summary(fun.data = "mean_cl_boot",
               geom = "linerange",
               size = 1.5) +
  # change labels
  labs(title = "Price as a function of quality of cut",
       subtitle = "Note: The price is in US dollars", # we might want to change this later
       tag = "A",
       x = "Quality of the cut",
       y = "Price")

This makes it much easier to see what’s going on, and you can easily add comments to individual lines of code.

RStudio makes it easy to write nice code. It figures out where to put the next line of code when you press ENTER. And if things ever get messy, just select the code of interest and hit cmd + i to re-indent the code.

Here are some more tips on how to write nice code in R:

3.2 Getting help

There are three simple ways to get help in R. You can either put a ? in front of the function you’d like to learn more about, or use the help() function.

?print
help("print")

Tip: To see the help file, hover over a function (or dataset) with the mouse (or select the text) and then press F1.

I recommend using F1 to get to help files – it’s the fastest way!

R help files can sometimes look a little cryptic. Most R help files have the following sections (copied from here):


Title: A one-sentence overview of the function.

Description: An introduction to the high-level objectives of the function, typically about one paragraph long.

Usage: A description of the syntax of the function (in other words, how the function is called). This is where you find all the arguments that you can supply to the function, as well as any default values of these arguments.

Arguments: A description of each argument. Usually this includes a specification of the class (for example, character, numeric, list, and so on). This section is an important one to understand, because arguments are frequently a cause of errors in R.

Details: Extended details about how the function works, provides longer descriptions of the various ways to call the function (if applicable), and a longer discussion of the arguments.

Value: A description of the class of the value returned by the function.

See also: Links to other relevant functions. In most of the R editors, you can click these links to read the Help files for these functions.

Examples: Worked examples of real R code that you can paste into your console and run.


Here is the help file for the print() function:

Help file for the print() function.

Figure 3.3: Help file for the print() function.

3.3 Installing and maintaining packages

What makes R powerful is the large number of packages that have been written for R. You can install a new package like so:

install.packages("tidyverse")

You can also install multiple packages at the same time, by concatenating the package names using the c() function:

install.pacakges(c("tidyverse","broom"))

To make sure that your packages remain up to date, you can go to Tools > Check for Package Updates ... in R Studio.

Help file for the print() function.

Figure 3.4: Help file for the print() function.

You can then click Select All and then Install Updates.

Help file for the print() function.

Figure 3.5: Help file for the print() function.

R Studio might ask you to restart your R session before updating the packages.

3.4 R Markdown

3.5 Some general advice

  • naming functions and files
  • always use relative links
  • so that stuff also works on other people’s computers
  • load all the packages at the top of the script
  • make sure that a script can be executed from top to bottom
  • project management (folder structure)
  • don’t write past the vertical rule in code blocks
  • learning keyboard shorcuts: Tools > Keyboard Shortcuts Help
  • how R handles functions with arguments (order matters if arguments aren’t named)

3.6 R syntax

4 Doing stuff

4.1 Loading packages

The order in which packages in R are loaded matters!

You can refer to functions from specific packages by adding the function name at the beginning. For example, this command would use the select() function from the MASS package MASS::select(), while this command would use the function from the dplyr package dplyr::select().

Always load library("tidyverse") last because it loads a large number of functions that are frequently used.

4.2 Importing data

df.data = read_csv(file = "../../data/top2018songs.csv") %>% 
  mutate(rank = 1:nrow(.))

The quickest way to take a look at your data is to hover your mouse over a variable of a data frame, and press F2.

column description
id Spotify URI of the song
name Name of the song
artists Artist(s) of the song
danceability Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.
loudness The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness Predicts whether a track contains no vocals. ‘Ooh’ and ‘aah’ sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly ‘vocal’. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms The duration of the track in milliseconds.
time_signature An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).

4.3 Data visualiztion

4.3.1 How not to visualize data

We should always take a look at the data first.

include_graphics("../../figures/plots/bad_plot1.png")
A not so good plot.

Figure 4.1: A not so good plot.

include_graphics("../../figures/plots/bad_plot2.jpg")
Another could-be-improved plot.

Figure 4.2: Another could-be-improved plot.

This second plots reminded me of the following:

include_graphics("../../figures/plots/correlation_aint_causation.png")
Correlation is not causation.

Figure 4.3: Correlation is not causation.

Just because two lines look similar, doesn’t mean that anything interesting is going on – it certainly doesn’t mean that the two phenomena represented by the lines are causally connected. For more inspiration check out this site https://www.tylervigen.com/spurious-correlations.

4.3.2 Why you should always visualize your data first

__The Datasaurus Dozen__. While different in appearance, each dataset has the same summary statistics to two decimal places (mean, standard deviation, and Pearson's correlation).

Figure 4.4: The Datasaurus Dozen. While different in appearance, each dataset has the same summary statistics to two decimal places (mean, standard deviation, and Pearson’s correlation).

The data sets in Figure 4.4 all share the same summary statistics. Clearly, the data sets are not the same though.

Tip: Always plot the data first!

Here is the paper from which I took Figure 4.4. It explains how the figures were generated and shows more examples for how summary statistics and some kinds of plots are insufficient to get a good sense for what’s going on in the data.

include_graphics("../../figures/plots/box_violin.gif")
Boxplots can be misleading.

Figure 4.5: Boxplots can be misleading.

4.3.3 Visualizing data using ggplot2

ggplot(data = df.data,
       mapping = aes(x = danceability,
                     y = valence)) + 
  geom_point()

ggplot(data = df.data,
       mapping = aes(x = danceability,
                     y = valence)) + 
  geom_point() +
  geom_smooth(method = "lm")

ggplot(data = df.data,
       mapping = aes(x = key,
                     y = valence)) + 
  stat_summary(fun.data = "mean_cl_boot",
               geom = "linerange") +
  stat_summary(fun.y = "mean",
               geom = "point",
               size = 3)

ggplot()

4.4 Data manipulation

4.4.1 Data transformation

4.4.2 Data wrangling

4.5 Statistics

  • linear model lm()
  • linear mixed effects models lmer()
  • Bayesian models brm() (using library("brms"))

4.6 Saving data

4.7 Help others help you

  • making reproducible examples

5 Where can I learn more?

5.1 Free online books